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Abstract 

Although  many  sophisticated  parallel  algorithms  now  exist, 
it  is  not  at  all  clear  if  any  of  them  is  sensitive  to  proper- 
ties of  the  input  which  can  be  determined  only  at  run-time. 
For  example,  in  the  case  of  parallel  addition  in  shared  memory 
models,  we  intuitively  understand  that  we  should  not  add  those 
inputs  whose  value  is  zero.  A  technique  which  exploits  this 
idea,  may  beat  the  general  lower  bound  for  addition  if  the 
count  of  nonzero  operants  is  much  smaller  than  the  numbers 
to  be  added.  In  this  paper,  we  device  such  algorithms  for 
two  fundamental  problems  of  parallel  computation.  Our  model 
of  computation  is  the  CRCW  PRAM.  We  first  provide  a  rando- 
mized algorithm  for  parallel  addition  which  never  errs  and 
computes   the  result  in  0  (logm)  expected  parallel  time,  where 
m .  is  the  count  of  nonzero  entries  among  the  n  numbersto  be 
added.  This  algorithm  uses  0  (m)  shared  space.  We  then  use 
this  result  to  solve  the  following  problem  of  processor  iden- 
tification :  n  processors  are  given,  each  keeping  either  a 
0  or  an  1.   We  want  each  processor  at  the  end,  to  know  which 
are  the  processors  with  the  I's.   Our  solution  is  randomized 
and  sensitive  to  the  number  »vs  of  the  I's.   It  takes 

0  (min  ^  m,  n   logm/logn})  expected  parallel  time  and  only 

0  (m)  shared  memory,  capable  of  holding  only  0(n)  size  numbers. 

Combinatorial  techniques  of  Erdos  and  Renyi  were  helpful 
to  a  part  of  this  second  result. 

All  our  techniques  enjoy  the  following  properties  : 
(1)  They  never  produce  an  erroneous  answer  (2)  if  T  is  the 
actual  parallel  time  and  E  (T)  its  expected  value,  then  Prob 

|_T>k.E  (T)  j  -^  n    where  k  is  arbitrary  and  c  >  1  is  linear 
on  k  and  can  be  specified  by  the  algorithm  implementer. 

(3)  m  is  initially  unknown  to  our  algorithms.  They  produce 
an  accurate  estimate  of  it. 
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1.   Introduction 

Recently  there  has  been  much  interest  in  fast  parallel 
algorithms  that  employ  randomization.  Although  many  sophisti- 
cated such  algorithms  now  exist  (see  e.g.  the  proceedings  the 
16th  STOC  Conference) ,  it  is  not  at  all  clear  if  any  of  them 
are  sensitive  to  properties  of  the  input  which  can  be  determined 
only   at  runtime.  For  example,  in  the  case  of  parallel  addi- 
tion (or  multiplication)  in  shared  memory  models,  we  under- 
stand intuitively  that  we  should  not  add (multiply ) those  inputs 
whose  value  is  zero  (one).  Even  if  we  manage  to  quickly  esti- 
mate the  number  of  nonzero  inputs,  we  still  have  to  organize 
them  in  an  appropriate  manner  (e.g.  pack  them  in  shared  me- 
mory) ,  in  order  to  perform  the  addition.  Our  goal  here  is  to 
device  such  algorithms  (by  use  of  randomization)  which  are 
sensitive  to  such  dynamic  properties  of  the  input  and  hence 
beat  the  known  lower  bounds   (which  hold  for  the  general 
case)  . 

We  use  the  synchronous  Concurrent  Read   -  Concurrent 
Write  (CRCWJ  model  of  parallel  computation  (called  WRAM) 
(see  e.g.  [SY ,   BO],\G,    68]  ).  This  model  assumes  the  presence 
of  a  (potentially)  u~nlimited  number  of  processors  with  (po- 
tentially) unlimited  local  memory  in  each  processor.  We  assume 
our  processors  capable  of  doing  independent  probabilistic 
choices  on  a  fixed  input  (This  was  first  used  by  Fr,  82  a,b] 
and  [V,  83])  .  WRAM  is  like  the  PRAM  of  [W,  79]  and'fFW,  78] 
in  the  sence  that  different  processors  can  read  the  same  me- 
mory location  at  the  same  time.  However,  in  the  case  of  a 
simultaneous  write  attempt,  exactly  one  processor  succeeds 
in  the  WRAM  model. We  make  no  assumption  of  which  one  succeeds 
but  we  assume  that  the  failed  ones  are  notified.  This  can 
be  easily   implemented  by  having  processors  read  the  result 
of  the  "write". 

We  first  consider  the  fundamental  problem  of  parallel 
addition  of  n   numbers.  Our  technique  first  provides  a  pro- 
babilistic estimate  of  the  count  (m)  of  the  non-zero  inputs, 
and  then  uses  a  probabilistic   method  to  lay  them  out  in 
shared  memory  and  add  them.  The  whole  algorithm  takes  0  (logm) 
expected  parallel  time,  uses  0  (m)  shared  space  and  involves 
only   m  processors.  To  our  knowledge,  deterministic  WRAM 
algorithms  for  addition  have  to  take  0  (logn)  parallel  time 
when  at  most  n  processors  are  used  and   n   numbers  are  to  be 
added. 

We  then  examine  the  related  problem  of  processor  iden- 
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tification  :   n  processors  of  a  WRAM  are  given,  each  proces- 
sor  must  Find  out  which  are  the  processors  with  the  I's.  We 
assume  that  each  shared  memory  location  can  fit  a  number  of 
at  most  0  (n)  size.  We  first  use  a  nice  technique  of  Erdos 
and  Renyi  (ER,  63) ,  to  provide  an  0  (n  /logn)  parallel  time 
solution  to  the  problem,  if  counting  of  0  (n)  units  had  unit 
cost.  We  then  use  our  first  result  (about  addition)  to  pro- 
vide an  0  (rain  {m,  n   logm/logn3  )  expected  parallel  time 
algorithm  for  the  WRAM  which  uses  only  0  (m)  shared  space.  We 
also  give  a  matching  lower  bound  for  the  parallel  time. 

All  our  results  satisfy  the  following  :  If  T   is  the 
actual  parallel  time  of  our  algorithm  and  E(T)  is  its  expected 
value,  then  Prob  (T>k.  E  (T)]  ^  n"C  where   k  >  1   is  any 
constant  and  c>  1  grows  linearly  with  k   and  can  be  control- 
led by  the  algorithm  designer. 


2.   The  case  of  parallel  addition 

2.1.   The  Algorithm 

Let  the  array  M  represent  the  shared  memory.  Let  3.-^ 
be  a  positive  integer  constant.  Let  each  processor   Pi  be 
equipped  with  a  local  variable,  TIME.,  intended  to  keep 
the  current  parallel  step.  Initially^  each  processor  P^ 

(1  $  i  t^  n)  holds  locally  a  number  x- .  The  goal  is  to  compute 

the  sum  of  the  x . ' s .  Let   m  be  the  number  of  the  nonzero 
x. 's.   We  give  tne  algorithm  in  two  parts   :  Procedure 
ADDITION  (m')  actually  performs  the  addition,  assuming  an 
estimate   m'=  cm  +d,  (c ,  d  >  1   constants)  known.  Function 
ESTIMATION  produces  such  an  estimate.  So,  the  whole  algorithm 
has  the  following  high  level  description  : 

begin 

m'  ^    ESTIMATION 

ADDITION  (m') 
end 

We  provide  the  description  of  ESTIMATION  first.  In  ESTIMATION, 
each  Pi   with  x-^'o  produces   k   estimates  of   m   (k  is  a 
constant)  through  a  probabilistic  technique,  and  then  does  a 
variance-reduction  process  to  get  the  final  estimate.  The  actual 
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value  of  k  is  determined  in  the  analysis. 

Function  ESTIMATION 

procedure   PRODUCE -AN -ESTIMATE 

begin 

stage  1   (Initialization) 

Processor  P,   initializes  a  special  shared 
memory  location  (CLOCK)  to  zero.  Them,  each  P. 

executes  TIME.  -6-  o. 

1 

stage  2  (Estimate) 

Processor  P. 

1 

if  X.    ^   o     then 
begin 

(1)  Flip  a  fair  coin  (two-sided) 

(2)  If  the  autcome  is  'tail 'then 

begin 

(2a)   TIME^-  :  -r-TIMEj  +1 

(2b)   CLOCK  <r-   TIMEj 

(2c)   go  to   (1) 

end  ^Bi 

comment  :This  is  done  by  processors  which 
flipped  a  'head' 

(3)  If     X.    ^   Q        then 

begin 

(4)  read  CLOCK  into  a  local  variable  Rl 

(5)  wait  for  5  steps 

(6)  read  CLOCK  into  a  local  variable  R- 

(7)  If  Rl  ji   R2   then  go  to   (4) 
end 

comment   At  this  point,  every  Pi  with  xi  ^  o 
has  flipped  a  'head' 
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(8)   Each  Pi  with  x.  /  o  reads   CLOCK  and  makes  its 
value  to  be  thi  current  estimate. 

end   (of  procedure  PRODUCE-AN-ESTIMATE) 

begin  (main  part  of  ESTIMATION) 

Each  Pi  with  x.    ^   o     runs  procedure  PRODUCE-AN- 
ESTIMATE  k   tiikes  and  produces  estimates 
E,,  E2,  ...,  Ej^.   Then   all  compute 

E,  +  ...+E, 
(1)    E  ^  (log2)   - 


k 

(2)    m' ^  exp(2) .exp  (E)  +  d 

where   d  ^  1  is  a  constant. 

m*  is  the  value  returned  by  ESTIMATION. 
We  assume  that  it   is  written  to  a  special  shared  memory 
location,  so  that  it  is  available  to  all  processors. 


end 


We  now  provide  a  description  of  procedure  ADDITION  (m'). 
It  has  3  stages  : 


PROCEDURE  ADDITION  (m') 

Stage  1   (Initialization) 

In  one  parallel  step,  processors  initialize  a.m'+2  shared 
memory  locations  to  zero,  by  executing  :"Processor  P.  writes 

a  zero  to  M  (j),  i  f  j  -^  am'  +2.  "Then,  they  all  execute  TIME,  t-0 

Stage  2  (Memory  Marking) 

Processor  P. 

) 

IF  X.  j'  0  then 

BEGIN 
(1)  Select  y  equiprobably  at  random  from  1 1 , 2  , .  .  .  ,am'j 


(2)  TIME.  .*-  TIME.  ♦  1 

(3)  Read  M  (y)  ;  TIME  .  -t-  TIME  .  +  1 

(4)  If  M(y)  =  0  then  write  x.   into  M  (y) . 
Also,  TIME.  ^-TIME.  ♦  1 

(5)  If  the  "write"  failed  then 

BEGIN 

(5a)  write  TIME,   into  M  (am'+l) 

(5b)  go  to  (1) 

END 

END 

Comment  :  This  part  is  executes  by  P.  with  x.  =0  and  by 
"successful"?,  with  x.  ^0 .  ^  ^ 

(6)  Read  M  (am'+l)  into  a  local  variable  Rl 

(7)  Wait  for  8  steps 

(8)  Read  M  (am'+l)   into  a  local  , variable  R2 . 

(9)  If  Rl  ?^  R2  then  go  to  (6)  ^  ^-- 

Comment:  Rl  =  R2   means  all  processors  with  x.  ^0  succee- 
ded into  writing  x-  in  a  shared  memory  location,  diffe- 
rent for  each  processor,  among  M  (1),...,  M  (am*).  (If 
a  processor  was  failing,  the  value  of  M(am'+1)  would 
change) . 

Stage  3  (Addition) 

(Processor  P.  is  assigned  to  location  M(j),  1  ^  j  -^^  am') 

From  this  point  on,  processors  P.  (where  1  ^  j  -^  am') 
perform  a  standard  parallel  addition  of  the  numbers  M(l),..., 
M(am) .  In  the  ith  parallel  step  of  the  addition,  processor  P. 
adds  M  (j)  and  M(j+2i)into  M(j),  for  j=k.2^+l,  k  =0,1,2,...  ^ 
am'/2i.  (See  e.g.  (K,  82)  or  (FW,  78)  on  how  to  do  the  paral 
lei  addition  of  am'  numbers  by  am'  processors  in  0  (m')  space 
and  0  (log  m')  parallel  time). 


2.2.   Analysis  of  the  Algorithm 

Lemma  1   At  the  end  of  each  execution  of  procedure  PRODUCE- 

AN-ESTIMATE,  the  variable  CLOCK  is  a  random  variable,  whose 
mean  and  variance  satisfy  : 

(1)  E  (CLOCK)  .  log2   >   logm  +  0.5 

(2)  (E  (CLOCK)  -l).log2  ^  logm  +  0.5 
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(3)   var  (CLOCK)   4   4 

Sketch  of  Proof   (Details  in  full  paper) 

CLOCK  is  the  maximum  of  m  independent  geometric  random 
variables  X,,  ...,  X   (the  number  of  coin  flips  until  a 
head'  of  the  Pi's  wi'fh  x.?^0)  with  density  Prob  ^  Xi  =j  }  = 

(1/2)^   j  >.  1.   The  rest  is  a  relatively  easy  calculation,  since 
Prob  [  CLOCK  <  j  ]     =(Prob  {  X^  <  j  }     )"* 

Lemma  2   Given  any  5  >  o,  if  we  choose   k  ^  4/6   then,  with 
probability  at  least  1-5   ,  we  have 

(1)    iE  -  logml  ^  2 

and   (2)   The  total  running  time  of  ESTIMATION 

is  4 

0  (  -5-  .  103111  }  . 

Proof  sletch   (Complete  Proof  in  full  paper) 

From  Chebyshev  inequality  and  Lemmal 

we  get  Prob   [  \E-  logml  ^  I.2]  ^  1-  4/5 

Also  note  that  the  running  time  of  ESTIMATION 
is   0  (k.E)   =0  (E^  +...  +Ej^) 

Corollary   1   Given  any  6  >  0,  if  we  choose   k  >  4/5 
then,  with  probability  at  least  1-5   we  have 

m  ^  m'  ;?  m.  exp(4) 

Proof  It  follows  immediately,  by  Lemma  2. 

In  the  following  we  assume   k  ^  /6   for  a  fixed  small  5. 

Lemma  3   Conditioned  on  the  event 

c.  =  [m  4     m'  ^  m.  exp  (4)  j  ,  the  time  of  stage  2  of  procedure 
ADDITION  (m')  has  an  expected  value  of  O(logm).  Furthermore, 

the  (conditional  on  £  )  probability  that  the  time  of  stage  2    \ 

1 

exceeds  3.  logm  is  4^  m"  ^^°^  ^"^(and  can  be   made  arbitrarily  sma| 
Proof  sketch   (See  full  paper  for  details)  I 


It  is  easy  to  see  that  every  time  a  processor  P.  attempts  to 

write  its  x.,  and  if  g  ^  m  shared  memory  locations  are 
already  "ociupied','  the  competitors  of  P.  are  m-g-1.  Even  if 
all  of  them  manage  to  select  different  ihemory  locations  which 
were  not  occupied  previously,  the  maximum  number  of  locations 
that  P.  must  "avoid"  is  g  +  m-g-1  =ra-l. 
So,  P--'  will  succeed  with  probability 

at  least   ^"'' "  ^"'"^^  >,     ^^' <^' -^)    y^  -ill 
am'  am'        a 

in  each  trial  (and  this  holds  for  every  Pp. 

A  generalization  of  Lemma  1  about  the  maximum  of  m  geometries 
with  success  probability  ^  1  -  1/a   implies  that  the  average 
number  of  parallel  steps  required  for. all   m  processors  to 
succeed  is  O(logm')  =0  (logm) . 

The  probability  that  there  exists". a  processor  which  continues 
failing  for  at  least   logm  rounds  is 


<^  m.  (  i  )P^°S">  ^   ^-31oga-Hl 


It  is  easy  to  see  that  the  algorithm  uses  0(m*)  shared 
memory,  O(m')  processors,  and  performs  the  addition  correctly, 
because,  at  the  end  of  stage  2  of  ADDITION (m' )  the   m  nonzero 
X • ' s  are  placed  one  in  each  of  m  shared  memory  locations, 
and  these  locations  are  among  M  (1),...,  M  (am*).  The  rest 
of  these  locations  contain  zeros.  So,  we  have  : 

Lemma  4   Given  any  be    (0,1),  we  can  choose  3   0  such  that 
with  probability  at  least 

1-  max  (6  ,  m'   ^^8^   ^  ^  ^^j,  algorithm  performs  the  parallel; 
addition  in  0  (logm)  time,  uses  0(m)  shared  space^  and  0(m) 
processors.  Our  algorithm  never  errs.  With  diminishingly  small  ' 
probability,  it  may  choose  a  bad  estimate  m'  of  m  and  hence 
it  may  never  exit  the  loop  (6) -(9)  of  stage  2  of  ADDITION(m' ) . 


3.  The  processor  identification  problem. 

3.1.  An  0  (  n   )  parallel  time  algorithm  which  assumes 
logn 
unit  -cost  addition 
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The  processor  identification  problem  assumes  that  n 
processors  are  given,  each  keeping  either  a  0  or  an  1.  The 
problem  is  for  each  processor  to  find  out  which  are  the  pro- 
cessors with  the  I's.  We  first  solve  this  problem  for  the 
so-called  strong  W-RAM   (SW-RAM)  model.  This  model  has  the 
property  that  simultaneous  writes  on   the  same  memory  location 
succeed  only  if  they  write  the  same  value,  and,  if  that  is 
the  case,  their  sum  is  recorded.  We  also  assume  that  each 
shared  memory  location  can  hold  only  up  to  0(n)  size-numbers. 
Let  us  imagine  that  all  the  processors  are  equipped  with 
the  same  list   L  =  1^,  l2,...l3   of  "testing"  sequences, 

where  each  1.   is  an  n-bit  sequence  of  O's  and  I's.  Let  us 
also  assume  that   L   is  independent  of  the  particular  assign- 
ment of  O's  and  I's  to  processors.  In  the  following,  let   v 
be  a  fixed  memory  location  and  let  e.  be  the  value  of  pro- 
cessor P..  Processors  execute  the  following  sequence  of 
steps,  s  •*■  times  : 

Round   i   (1  4:  i  <  s) 

(1)  P.  erases  v's  contents  by  writing  a  0 

(2)  Each  P.  (1^  j  ^  n)    looks  at  the  j^   position  of 


1.,  If  Ll-J(j)  =1  and  e.=  1  then  P.  writes  e.  to  locat 

V. 

(31   Each  P.   reads  v's  contents. 
J 

At  the  end  of  the   s  rounds,  each  processor  has,  for 
each  testing  seqeunce ,  the  number   of  places  in  which  an 
1  stands  both  in  the  testing  seqeunce  and  in  the  sequence 
e.  e-...e   to  be  guessed.  If  L  allows  each  processor  to 
find  e^...e^  after  the  s   rounds,  we  call   L   an  s-algorithm 

for  the  processor  identification  problem  (We  allow  unrestri- 
cted local  memory  per  processor) . 

An  obvious   L   (which  would  take  0(n)  parallel  time)  is 
that  consisting  of  n   li's  with  1.  (j)  =  0   for  i^j  and 

l^Ci)  =  1  for  all  i. 

Erdos  and  Renyi  (ER,  63)  considered  a  very  closely  re- 
lated problem,  the  "coin-weight"  problem.  Using  their  tech- 
niques, we  show  that  the  s   needed  is  6  (n/logn)  and  that   L 
can  be  easily  constructed. 

Let  us  view   L   as  an  sxn  matrix  of  O's  and  I's. 

Lemma  5   (see  also  (ER,  63)  ).  A  matrix   L,  sxn  of  O's 
and  I's  is  an  s-algorithm  for  the  processor  identification 
problem  iff  :For  each  pair  c,  c' of  subsets  of  the  set   C 


ion 
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of  columns  of  L,  such  that   c/  c',  if  we  form  the  row-sums 
of  the  submatrices  L  (c)  and_I:^(c')  (consisting  of  the  sele- 
cted columns)  and  denote  by  V^   and  V  , the  column-vectors 

consisting  of  these  row-sums,  then  V  f   V  , 

Proof  sketch  After  m  rounds,  each  processor  has  a  row-sum 
vector,  V  ,  of  L.  This  corresponds  to  just  one  subset  c   of 
the  set  of  columns  of  L.  This  subset  determines  the  processors 
with  the  I's,  because   c   is  exactly  the  subset  of  processors 
with  a  value  equal  to  1. 


Here,  the  techniques  of  (ER,  63)  can  be  used  to  prove 


Lemma  6   A  matrix  L  ,  sxn,  with  Sr=;  an/(logn),  a  "^  6 ,  chosen 
so  that  the  sn  entries  are  independent  random  variables  each 
taking  on  the  values  0,  1  with  probability  1/2  ,  is  an  s-algo- 
rithm,  with  probability  tending  to  1  as  n  — ?■  +'>« 

Proof  sketch   Let  p  =  prob  £L  is  an  s-algorithmj 

Let  q  =  1-p 

Let  E  (c-,  C2) ,  where  c^,  c^   are  subsets  of  the  set  of  columns 

C  of  L,  denote  the  event  that  v   =  v    (where  v  •  is  the 
'  ci    C2  ci 

row-sum  vector  of  L  (c-)   If  c^^,  C2  are  not  disjoint,  then  if 
d,  =  c,-c,  O  z~    and   d-  =C2-c^n  c^,  we  have   v,   _  v,  .   Hence, 

if  L  is  not  an  s-algorithm,  there  exist  disjoint   subsets  of 

the  set  of  columns,  such  that  v,   =  v,   .   So 

di    d2 

q  ^  Z    prob  [E(di,  d2)] 
where  d, ,  ^^   disjoint  subsets  of  C. 
One  can  then  get,  by  some  conbinatorics ,  that 


q  $  2"^^°S3-a/2)-  o(n) ^  ^^^ 


an 
s=  


logn 
So,  if  we  choose    a  >  log29   *  2    then 

we  get    q  ^  2      and    lim  q  =  0 


n- 


►  o» 
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Corollary   There  exists  an  s-algorithm,  for 

an 
s  =  ,    a  ^  6 . 


logn 
Proof 

Since,  from  Lemma  6,   q  <  2  '^  ,  we  get  p  >  1-2   >  0. 

(In  fact  the  vast  majority  of  random  o-l  sxn  matrices  are 
s-algorithms) . 


3.2.  An  0  (min  [m,  n  i^SH!  |^  algorithm  for  the  WRAM. 

logn  ■• 

In  the  following,  let  m  be  the  number  of  ones  among  the 
®1'  ^2' ' *  *  '  ®n* 

(a)   An  0  (m)  parallel-time  algorithm  for  identification. 

THE  MARKING  ALGORITHM 

(Stage  1)   The  WRAM  run^  the  algorithm  for  parallel  addition  once 
(as  explained  in  Section  2)  for  the  values  ej  of  the  processors. 
At  the  end  of  this  process  (which  takes  0  (logm)  time  with 
high  probabdlity) ,  each  processor  knows  the  number  of  ones  among 

®1»  ®2  '  • ' *  '®n' 

Ptage  2) 

The  WRAM  runs  the  stages  1  (Initialization)  and  2  (memory  markinj 
of  the  procedure  ADDITION  (m) ,  with  the  following  modification: 
Each  time  a  processor  marks  a  memory  location,  it  writes  its  id 
instead  of  its  value.  At  the  end  of  this  stage,  the  m  iS'^'i  " 
of  the  processor  with  nonzero  values  have  been  placed  "contigu- 
ously" in  M(l),  M(2),...,  M(am) . 

(Stage  3)  Each  processor  reads  the  memory  locations  M(l),..., 
M(am)  in  sequence. 

Lemma  7   The  marking  algorithm  solves  the  identification  pro- 
blem  in  the  WRAM,  in  0(m),  parallel  time,  with  arbitrarily 
high  probability. 


./.. 
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Proof  sketch 

By  Lemma  4  of  Section  2,  the  first  stage  of  the  algorithm  taVes 
OClogm)  time  with  probability  at  least  1-max  (  5  ,  m'*^^°&   ^*1) 
where   5   and  0   can  be  selected  by  the  implementer.  The 
second  stage  of  the  algorithm  takes  0  (logm)  time  with  proba- 
bility at  least  l-m-31og  a+1^  ^y   Lemma  3  of  Section  2.  The  last 

stage  of  our  algorithm  takes  am  time.  Our  algorithm  never  re- 
ports an  erroneous  answer.  Ilovever,  with  diminishingly  small 
probability,  it  may  never  terminate. 

(3)   An  0  (  n  ±£M  )   expected  parallel  time  algorithm 

logn 

for  the  WRAM.' 

The  WRAM  here  will  simulate  the  SW-RAM  of  Section  3.1.,  as 
follows  : 


(stage  1)   The  WRAM  runs  the  algorithm  for  parallel  addition 

once,  for  the  values  e.  of  the  processors.  At   the  end  of 

J       .     ' 
this  process,  each  processor  knows  the  number  of  ones  (m) 

among  the  e.'s. 

(Stage  2)      The  WRAM  runs  the  s-algorithm,  by  simulating  step 
2  of  each  round,  with  the  procedure  ADDITION  (m)  (described 
in  Section  2) . 


Lemma  8   The  simulation  algorithm  described  above,  runs  in 

0  (n  — ^  )  expected  parallel  time, 
logn 

Proof  sketch   Stage  1  runs  in  0  (logm)  expected  time,  by 
Lemma  4  of  Section  2.  Each  round  of  the  s-  algorithm  runs  also 
in  0  (logm)  expected  time,  by  Lemma  3  of  Section  2. 

In  the  full  paper,  we  also  prove  that 

Lemma  9   If  m  =  *'(n)  ,  then  our  simulation  algorithm  runs  in 
0  (logm)  parallel  time,  with  probability  at  least      _, 

1-m   logqi 

(c)  The  conbination  of  the  two  techniques.  We  can  have  the 
WRAM  running  both  algorithms  (a)  and  (b)  interleaved  (one 
parallel  step  of  (a),  and  then  one  parallel  step  of  (b) . 
When  one  of  the  two  techniques  terminates,  the  processors  will 
stop. 
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4,   Remarks  and  Lower  bounds 

Lemma  10   No  s-algorithm  can  have   s  <  j^^^^^j^j 

Proof  sketch  Each  processor  needs  at  least  log  (2^)  =  n 
"pieces  of  information"  to  distinguish  between  the  2  possible 
assignments  of  O's  and  I's  to  processors.  On  the  other  hand, 
if  k  processors  attempt  an  addition  (in  step  2  of  each  round) , 
the  amount  of  information  obtained  cannot  exceed  log  (k+l)<j 
log  Cn+1)  because  the  number  of  I's  among  them  are  0,1  or,..., 
or  k.  So,   s   rounds  can  give  at  most  s-log(n+l)  pieces  of 
information  to  each  processor. 

a 

Remark 

Once  the  processors  have  an  s-algorithm  L,  then^  can 

construct  a  table   of  the  yjossible  row-sum  vectors   V  and 

'■  c 

their  corresponding  subset   c   of   L.  Then,  given  any  instance 
^f  the  identification  problem,  then  need  0(s)  rounds  to  find 
V  and  one  (indexed)  table  access  to  find  c   and  solve  the 

problem.  Another  piece  of  the  preprocessing  work  is  the  con- 
struction of  L   itself.  It  seems  to  us  that  the   n  processors 
of  the  WRAM  will  need  6  (n^/logn)  time  to  agree  to  a  common 
random  L.  Clearly,  our  algorithm  of  Section  3.1  and  of  Sec- 
tion 3.2  -(b),  becomes  practical  in  dynamic  environments, 
where  the  values  of  the   n  processors  change.  We  pose  as  a 
general  open  problem  the  construction  of  input-sensitive  pa- 
rallel algorithms  for  other  problems  (so  that  the  "general" 
lower  bounds  are  beaten) .  A  possible  candidate  is  graph 
connectivity  for  special  types  of  graphs. 

Acknowledgments 

The  author  wishes  to  thank  C.H.  Papadimitriou ,  D.  Shasha 
and  Z.  Kedem  for  helpful  comments  in  previous  versions  of 
this  work. 


-14- 

REFERENCES 

(CLC,  83) 

Chin,  F. ,  J.  Lam  and  I.  Chen,  "Oprtimal  Parallel  Algo- 
rithms for  the  Connected  Components  Problem, "CACM83. 

(C,  80) 

Cook,  S. ,  "Towards  a  Complexity  Theory  of  Synchronous 
Parallel  Computations",  Specker  Symp.  on  Logic  and  Algo- 
rithms, Zurich,  Feb. 5-11,  1980. 

(DNS,  81) 

Dekel,  E.,  D.  Nassimi  and  S.  Sahni ,  "Parallel  Matrix  and 
Graph  Algorithms,  "SIAM  J.  Comp.  10  (4)  1981. 

(ER,  60) 

Erdos,  P.  and  A.  Renyi ,  "On  the  Evolution  of  Random 
Graphs,"  The  Art  of  Counting,- J.  Spencer  Editor,  MIT 
Press,  1973. 

CER,  6  3) 

Erdos,  P.  and  A.  Renyi,  "On  twO' problems  of  information 
theory"Mayar  Tud.  Akad,  Mat.  Kut-Int.  Kozl.  8  (1963); 
also  in  The  Art  of  Counting,  J"^  Spencer,  Editor,  MIT 
Press,  1973.  .   r 

(G,  78) 

Goldschlager ,  L. ,  "A  Unified  Approach  to  Models  of  Syn- 
chronous Parallel  Machines",  Proc.  11  th  sub  STOC ,  May  1978 

(G,  77) 

Goldshlager,  L. ,  "Synchronous  Parallel  Computation",  Ph. 
D.  thesis,  Univ.  of  Toronto,  C.S  Dept.,  1977. 

(GLR,  80) 

Gottlieb,  A.,  B  Lubachevsky  and  L.  Rudolph,  "Basic  Tech- 
niques for  the  efficient  coordination  of  very  large  num- 
bers of  cooperating  sequential  processors,  "Courant  Inst. 
TR  No.  028,  Dec.  1980. 

(HCS,  79) 

Hirschberg,  D. ,  A.  Chandra,  D.  Sarwate,  "Computing  Connec- 
ted Components  on  Parallel  Computers,  "CACM  22(8)  Aug. 1979. 

(K,  82) 

Kucera  L. ,  "Parallel  Computation  and  Conflicts  in  Memory 
Access",  Info.  Processing  Letters  Vol.  14,  April  1982. 

(MV,  83) 

Melhorn,  K.,  and  U.  Vishkin,  "Randomized  and  deterministic 
simulation  of  PRAMs  by  parallel  machines  with  restricted 


-15- 

granularity  of  parallel  memories,"  9th  Workshop  on  Graph 
Theoretic  Concepts  in  Computer  Science,  Univ.  Usnabruck, 
June  19  83. 

(R.  82) 

Reif,  J.,  "Symmetric  Complementation,"  14th  STOC ,  San 
Francisco,  CA,  May  1982. 

(R,  82b) 

Reif,  J. ,  "On  the  Power  of  Probabilistic  Choice  in  synchro- 
nous Parallel  Computations','  9th  ICALP,  Aarchus ,  Denmark, 
July  1982. 

(Ru,  79) 

Ruzzo,  W.  ,  "On  Uniform  Circuit  Complexity",  Proc.  20th 
FOCS,  Oct.  1979. 

(SJ,  81) 

Savage,  C.  and  J.  Ja'ja',  "Fast,  Efficient  Parallel  Algo- 
rithms for  Some  Graph  Problems",  SIAM  J.  Comp.  10  (4), 
Nov.  1981. 

(SV,  80) 

Shiloach,  Y.  and  U.  Vishkin,  "Finding  the  Maximum  Merging 
and  Sorting  in  a  Parallel  Computation  Model",  Tech.  Rep. 
Technion  Israel,  Comp.  Sci.,  March  1980. 

(SV,  82) 

Shiloah,  Y.  and  U.  Vishkin.  "An  0  (logn)  Parallel  Connec- 
tivity Algorithm",  J.  of  Algorithms,  1982. 

(S,  80) 

Schwartz,  J.T. ,  "Ultracomputers" ,  ACM  TOPLAS  1980,  pp. 
484-521. 

(UW,  84) 

Upfal,  E.  and  A.  Wigderson,  "How  to  share  memory  in  a 
distributed  system",  25th  FOCS,  Proceedings,  October  1984. 

(V.    83a) 

Vishkin,    U.,    "A  parallel-design,    distributed-implementation 
general   purpose   parallel    computer",    to    appear,    J.    TCS. 

(V,    8  3b) 

Vishkin,    U. ,    "Randomized   speeds-ups    in   parallel    computation 
16th   STOC,    April    1984,    Proceedings. 

(U.  84) 

Upfal,  E.  ,  "A  probabilistic  relation  between  desirable  and 
feasible  models  of  parallel  computation",  16th  ACM  STOC 
1984,  Proceedings. 

(W,  79) 

Wyllie,  J.,  "The  Complexity  of  Parallel  Computation",  Ph. 
D.  Thesis,  Cornell  University,  1979. 


NYU  COnPSCI  TR-182  a|  / 
Spirakis,  Paul  G  ^' ' 
Input  sens itive,  optimal 

parallel  randomized 

algorithms... 


NYU   COMPSCI   TR-182    QA 


input 
parallel 
alaorithms 


This  book  may  be  kept 

FOURTEEN    DAYS 

A  fine  will  be  charRed  for  each  day  the  book  is  kept  overtime. 


\ 


CAVLORD    142 


